support qwen2 hf<->mcore ckpt converter #1290

wenyujin333 · 2024-11-19T04:30:21Z

usage example: examples/qwen/README.md

Victarry · 2024-12-04T09:11:21Z

Hi @wenyujin333, could you please rebase the MR onto main branch to resolve the conflicts? Thanks

Victarry

Thanks for the great work and contribution to Megatron-LM.
I think there are some aspects that could help this MR merged into MCore.

It can be very helpful for users by adding a section of documentation to introduce the workflow to use HF<->MCore converter for Qwen models, like https://github.com/NVIDIA/Megatron-LM/tree/main/examples/mixtral
From the MCore developer's point of view, some codes are kind of difficulty to maintain since the code is complex and not very clear to follow like in saver_qwen2_hf.py. Perhaps we could restructure some sections to make the logic flow more apparent.

tools/checkpoint/loader_qwen2_hf.py

Victarry · 2024-12-05T03:54:27Z

tools/checkpoint/loader_mcore.py

+
+        # Dense modules
+        for tp_rank, model in enumerate(models[0]):
+            layer = get_transformer_block(model).layers[layer_num]
+            qkv_weight.append(layer.self_attention.linear_qkv.weight.data)
+            dense_weight.append(layer.self_attention.linear_proj.weight.data)
+            if md.linear_bias:
+                qkv_bias.append(layer.self_attention.linear_qkv.bias.data)
+            elif md.add_qkv_bias:
+                qkv_bias.append(layer.self_attention.linear_qkv.bias.data)
+            shared_expert_mlp_l0_weight.append(layer.mlp.shared_experts.linear_fc1.weight.data)
+            shared_expert_mlp_l1_weight.append(layer.mlp.shared_experts.linear_fc2.weight.data)
+
+        layer = get_transformer_block(models[0][0]).layers[layer_num]
+        router_weight = layer.mlp.router.weight.data
+        shared_expert_gate_weight = layer.mlp.shared_experts.gate_weight.data
+
+        # MoE modules
+        num_experts_per_rank = margs.num_experts // ep_size
+        for ep_rank, tp_models in enumerate(models):
+            for tp_rank, model in enumerate(tp_models):
+                layer = get_transformer_block(model).layers[layer_num]
+                for local_expert_idx in range(num_experts_per_rank):
+                    expert_idx = int(ep_rank * num_experts_per_rank + local_expert_idx)
+                    mlp_l0_weight_list[expert_idx].append(layer.mlp.experts.local_experts[local_expert_idx].linear_fc1.weight.data)
+                    mlp_l1_weight_list[expert_idx].append(layer.mlp.experts.local_experts[local_expert_idx].linear_fc2.weight.data)
+                    if md.linear_bias:
+                        mlp_l0_bias_list[expert_idx].append(layer.mlp.experts.local_experts[local_expert_idx].linear_fc1.bias.data)
+
+            if md.linear_bias:
+                # Get non-parallel tensors from tp_rank 0
+                layer = get_transformer_block(tp_models[0])
+                for local_expert_idx in range(num_experts_per_rank):
+                    expert_idx = int(ep_rank * num_experts_per_rank + local_expert_idx)
+                    mlp_l1_bias_list[expert_idx].append(layer.mlp.experts.local_experts[local_expert_idx].linear_fc2.bias.data)
+
+        mlp_l0_weight_w_list = [[] for _ in range(margs.num_experts)]
+        mlp_l0_weight_v_list = [[] for _ in range(margs.num_experts)]
+        # Concat along the tensor parallel dimension
+        for expert_idx in range(margs.num_experts):
+            mlp_l0_weight = mlp_l0_weight_list[expert_idx]
+            if md.swiglu:
+                for tp_rank in range(tp_size):
+                    mlp_l0_weight[tp_rank] = torch.chunk(mlp_l0_weight[tp_rank], 2, dim=0)
+                mlp_l0_weight_w_list[expert_idx] = torch.cat([w[0] for w in mlp_l0_weight], dim=0)
+                mlp_l0_weight_v_list[expert_idx] = torch.cat([w[1] for w in mlp_l0_weight], dim=0)
+            else:
+                mlp_l0_weight_list[expert_idx] = torch.cat(mlp_l0_weight, dim=0)
+            mlp_l1_weight_list[expert_idx] = torch.cat(mlp_l1_weight_list[expert_idx], dim=1)
+
+        # Stack along the expert parallel dimension
+        if md.swiglu:
+            message["mlp l0 weight W"] = torch.stack(mlp_l0_weight_w_list)
+            message["mlp l0 weight V"] = torch.stack(mlp_l0_weight_v_list)
+            for tp_rank in range(tp_size):
+                shared_expert_mlp_l0_weight[tp_rank] = torch.chunk(shared_expert_mlp_l0_weight[tp_rank], 2, dim=0)
+            message["shared mlp l0 weight W"] = torch.cat([w[0] for w in shared_expert_mlp_l0_weight], dim=0)
+            message["shared mlp l0 weight V"] = torch.cat([w[1] for w in shared_expert_mlp_l0_weight], dim=0)
+        else:
+            message["mlp l0 weight"] = torch.stack(mlp_l0_weight_list)
+            message["shared mlp l0 weight"] = torch.cat(shared_expert_mlp_l0_weight, dim=0)
+        message["shared mlp l1 weight"] = torch.cat(shared_expert_mlp_l1_weight, dim=1)
+        message["mlp l1 weight"] = torch.stack(mlp_l1_weight_list)
+
+        # Concat along TP and stack along EP to biases
+        if md.linear_bias:
+            mlp_l0_bias_w_list = [[] for _ in range(margs.num_experts)]
+            mlp_l0_bias_v_list = [[] for _ in range(margs.num_experts)]
+            # Concat along the tensor parallel dimension
+            for expert_idx in range(margs.num_experts):
+                mlp_l0_bias = mlp_l0_bias_list[expert_idx]
+                if md.swiglu:
+                    for tp_rank in range(tp_size):
+                        mlp_l0_bias[tp_rank] = torch.chunk(mlp_l0_bias[tp_rank], 2, dim=0)
+                    mlp_l0_bias_w_list[expert_idx] = torch.cat([w[0] for w in mlp_l0_bias], dim=0)
+                    mlp_l0_bias_v_list[expert_idx] = torch.cat([w[1] for w in mlp_l0_bias], dim=0)
+                else:
+                    mlp_l0_bias_list[expert_idx] = torch.cat(mlp_l0_bias, dim=0)
+                assert len(mlp_l1_bias_list[expert_idx]) == 1
+                mlp_l1_bias_list[expert_idx] = mlp_l1_bias_list[expert_idx][0]
+
+            # Stack along the expert parallel dimension
+            if md.swiglu:
+                message["mlp l0 bias W"] = torch.stack(mlp_l0_bias_w_list)
+                message["mlp l0 bias V"] = torch.stack(mlp_l0_bias_v_list)
+            else:
+                message["mlp l0 bias"] = torch.stack(mlp_l0_bias_list)
+            message["mlp l1 bias"] = torch.stack(mlp_l1_bias_list)
+
+        # Simple concat of the rest
+        message["qkv weight"] = torch.cat(qkv_weight, dim=0)
+        message["dense weight"] = torch.cat(dense_weight, dim=1)
+        if md.linear_bias:
+            message["qkv bias"] = torch.cat(qkv_bias, dim=0)
+        elif md.add_qkv_bias:
+            message["qkv bias"] = torch.cat(qkv_bias, dim=0)
+
+        # Do nothing to router
+        message["router weight"] = router_weight
+        message["shared gate weight"] = shared_expert_gate_weight


Could you refactor this block of code with better structure and reuse the duplicated code for better maintainability?

tools/checkpoint/saver_qwen2_hf.py

jon-barker · 2024-12-12T22:13:30Z

Hi. Thanks for your contribution. We actually already have HF->mcore non-MOE qwen 2 and 2.5 conversion but it's a little hidden as it's here: https://github.com/NVIDIA/Megatron-LM/blob/main/tools/checkpoint/loader_llama_mistral.py

The usage is currently documented in examples/multimodal/nvlm but we should document it in the main megatron docs

Perhaps you could add the MOE support to what we already have and then we can look to merge your contribution.

wenyujin333 marked this pull request as ready for review November 19, 2024 05:16

Victarry reviewed Dec 5, 2024

View reviewed changes

wenyujin333 force-pushed the features/qwen_converter branch 2 times, most recently from 0886d56 to 87dd51b Compare December 9, 2024 08:09

support qwen2 hf<->mcore ckpt converter

2a07758

wenyujin333 force-pushed the features/qwen_converter branch from 87dd51b to 2a07758 Compare December 10, 2024 05:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

support qwen2 hf<->mcore ckpt converter #1290

support qwen2 hf<->mcore ckpt converter #1290

wenyujin333 commented Nov 19, 2024 •

edited

Loading

Victarry commented Dec 4, 2024

Victarry left a comment

Victarry Dec 5, 2024

wenyujin333 Dec 9, 2024

jon-barker commented Dec 12, 2024 •

edited

Loading

support qwen2 hf<->mcore ckpt converter #1290

Are you sure you want to change the base?

support qwen2 hf<->mcore ckpt converter #1290

Conversation

wenyujin333 commented Nov 19, 2024 • edited Loading

Victarry commented Dec 4, 2024

Victarry left a comment

Choose a reason for hiding this comment

Victarry Dec 5, 2024

Choose a reason for hiding this comment

wenyujin333 Dec 9, 2024

Choose a reason for hiding this comment

jon-barker commented Dec 12, 2024 • edited Loading

wenyujin333 commented Nov 19, 2024 •

edited

Loading

jon-barker commented Dec 12, 2024 •

edited

Loading